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Abstract 

We address the text-to-text generation prob- 
lem of sentence-level paraphrasing — a phe- 
nomenon distinct from and more difficult than 
word- or phrase-level paraphrasing. Our ap- 
proach applies multiple-sequence alignment to 
sentences gathered from unannotated compara- 
ble corpora: it learns a set of paraphrasing pat- 
terns represented by word lattice pairs and au- 
tomatically determines how to apply these pat- 
terns to rewrite new sentences. The results of 
our evaluation experiments show that the sys- 
tem derives accurate paraphrases, outperform- 
ing baseline systems. 

1 Introduction 

This is a late parrot! It's a stiff! Bereft of life, 
it rests in peace! If you hadn't nailed him to 
the perch he would be pushing up the daisies! 
Its metabolical processes are of interest only to 
historians! It's hopped the twig! It's shuffled 
off this mortal coil! It's rung down the curtain 
and joined the choir invisible! This is an EX- 
PARROT! — Monty Python, "Pet Shop" 

A mechanism for automatically generating mul- 
tiple paraphrases of a given sentence would be 
of significant practical import for text-to-text gen- 
eration systems. Applications include summa- 
rization (Knight and Marcu, 2000) and rewriting 
:kar and Bangalore, 1997 ): both could employ 
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such a mechanism to produce candidate sentence para- 
phrases that other system components would filter for 
length, sophistication level, and so forth. 1 Not surpris- 
ingly, therefore, paraphrasing has been a focus of gen- 

1 Another interesting application, somewhat tangential 
to generation, would be to expand existing corpora by 



eration research for quite some time (McK eown, 1979| 
IMeteerand Shaked, 1988l|Dras, 1999) . 

One might initially suppose that sentence-level 
paraphrasing is simply the result of word-for- 
word or phrase-by-phrase substitution applied in a 
domain- and context-independent fashion. How- 
ever, in studies of paraphrases across several do- 
mains por danskaja, Kittredge, and Polguere ', 1991| 

|Robin, 1994| |McKeown, Kukich, and Shaw, 1994) , this 
was generally not the case. For instance, consider the 
following two sentences (similar to examples found in 
|Smadja and McKeown (1991) ): 



After the latest Fed rate cut, stocks rose across the board. 



Winners strongly outpaced losers after Greenspan cut in- 
terest rates again. 



Observe that "Fed" (Federal Reserve) and "Greenspan" 
are interchangeable only in the domain of US financial 
matters. Also, note that one cannot draw one-to-one cor- 
respondences between single words or phrases. For in- 
stance, nothing in the second sentence is really equiva- 
lent to "across the board"; we can only say that the en- 
tire clauses "stocks rose across the board" and "winners 
strongly outpaced losers" are paraphrases. This evidence 
suggests two consequences: (1) we cannot rely solely on 
generic domain-independent lexical resources for the task 
of paraphrasing, and (2) sentence-level paraphrasing is an 
important problem extending beyond that of paraphrasing 
smaller lexical units. 

Our work presents a novel knowledge-lean algo- 
rithm that uses multiple-sequence alignment (MSA) 
to learn to generate sentence-level paraphrases es- 
sentially from unannotated corpus data alone. In 

providing several versions of their component sentences. 
This could, for example, aid machine-translation evalua- 
tion, where it has become common to evaluate systems by 
comparing their output against a ban k of several reference 
translatio ns for the same sentences jPapineni et al., 2002| . 
See Bangalore, Murdock, and Riccardi (2002 1 and 
|Barzilay and Lee (2002) for other uses of such data. 



contrast to previous work using MSA for genera- 
tion ( |Barzilay and Lee,"20 02 ), we need neither parallel 
data nor explicit information about sentence semantics. 
Rather, we use two comparable corpora, in our case, col- 
lections of articles produced by two different newswire 
agencies about the same events. The use of related cor- 
pora is key: we can capture paraphrases that on the sur- 
face bear little resemblance but that, by the nature of the 
data, must be descriptions of the same information. Note 
that we also acquire paraphrases from each of the indi- 
vidual corpora; but the lack of clues as to sentence equiv- 
alence in single corpora means that we must be more 
conservative, only selecting as paraphrases items that are 
structurally very similar. 

Our approach has three main steps. First, working on 
each of the comparable corpora separately, we compute 
lattices — compact graph-based representations — to 
find commonalities within (automatically derived) groups 
of structurally similar sentences. Next, we identify pairs 
of lattices from the two different corpora that are para- 
phrases of each other; the identification process checks 
whether the lattices take similar arguments. Finally, given 
an input sentence to be paraphrased, we match it to a lat- 
tice and use a paraphrase from the matched lattice's mate 
to generate an output sentence. The key features of this 
approach are: 

Focus on paraphrase generation. In contrast to earlier 
work, we not only extract paraphrasing rules, but also au- 
tomatically determine which of the potentially relevant 
rules to apply to an input sentence and produce a revised 
form using them. 

Flexible paraphrase types. Previous approaches to 
paraphrase acquisition focused on certain rigid types of 
paraphrases, for instance, limiting the number of argu- 
ments. In contrast, our method is not limited to a set of a 
pn'on'-specified paraphrase types. 

Use of comparable corpora and minimal use of 
knowledge resources. In addition to the advan- 
tages mentioned above, comparable corpora can be 
easily obtained for many domains, whereas previ- 
ous approaches to paraphrase acquisition (and the 
related problem of phrase-based machine trans- 
lation (|Wang, 1998| |Och, TiUman, and Ney, 1999) 
|Vogel and Ney, 2000) ) required parallel corpora. We 
point out that one such approach, recently proposed by 
Pang, Knight, and Marcu (2003) , also represents para- 
phrases by lattices, similarly to our method, although 
their lattices are derived using parse information. 

Moreover, our algorithm does not employ knowledge 
resources such as parsers or lexical databases, which may 
not be available or appropriate for all domains — a key 
issue since paraphrasing is typically domain-dependent. 
Nonetheless, our algorithm achieves good performance. 



2 Related work 

Previous work on automated paraphrasing has con- 
sidered different levels of paraphrase granularity. 
Learning synonyms via distributional similarity has 
been well-studied (|Pereira, Tishby, and Lee, 1993 1 



IGrefenstette, 1994| |Lin, 1998). |Jacquemin (*1999> 
and IjBarzilay and McKeown (2001} identify phrase- 
level paraphrases, while [Lin and Pantel (2001) and 
|Shinyama et al. (2002) acquire structural paraphrases 
encoded as templates. These latter are the most closely 
related to the sentence-level paraphrases we desire, 
and so we focus in this section on template-induction 
approaches. 

|Lin a nd Pantel (2001 1 extract inference rules, which 
are related to paraphrases (for example, X wrote Y im- 
plies X is the author of Y), to improve question an- 
swering. They assume that paths in dependency trees 
that take similar arguments (leaves) are close in mean- 
ing. However, only two-argument templates are consid- 
ered. |Shinyam a et al. (2002) also use dependency-tree 
information to extract templates of a limited form (in their 
case, determined by the underlying information extrac- 
tion application). Like us (and unlike Lin and Pantel, who 
employ a single large corpus), they use articles written 
about the same event in different newspapers as data. 

Our approach shares two characteristics with the two 
methods just described: pattern comparison by analysis 
of the patterns' respective arguments, and use of non- 
parallel corpora as a data source. However, extraction 
methods are not easily extended to generation methods. 
One problem is that their templates often only match 
small fragments of a sentence. While this is appropriate 
for other applications, deciding whether to use a given 
template to generate a paraphrase requires information 
about the surrounding context provided by the entire sen- 
tence. 

3 Algorithm 

Overview We first sketch the algorithm's broad out- 
lines. The subsequent subsections provide more detailed 
descriptions of the individual steps. 

The major goals of our algorithm are to learn: 

• recurring patterns in the data, such as X (in- 
jured/wounded) Y people, Z seriously, where the 
capital letters represent variables; 

• pairings between such patterns that represent para- 
phrases, for example, between the pattern X (in- 
jured/wounded) Y people, Z of them seriously 
and the pattern Y were (wounded/hurt) by X, 
among them Z were in serious condition. 

Figure ^ illustrates the main stages of our approach. 
During training, pattern induction is first applied inde- 
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Figure 1: System architecture. 



(1) A Palestinian suicide bomber blew himself up in 

a southern city Wednesday, killing two other people and 
wounding 27. 

(2) A suicide bomber blew himself up in the settlement of 
Efrat, on Sunday, killing himself and injuring seven peo- 
ple. 

(3) A suicide bomber blew himself up in the coastal re- 
sort of Netanya on Monday, killing three other people and 
wounding dozens more. 

(4) A Palestinian suicide bomber blew himself up in a gar- 
den cafe on Saturday, killing 10 people and wounding 54. 

(5) A suicide bomber blew himself up in the centre of Ne- 
tanya on Sunday, killing three people as well as himself and 
injuring 40. 



Figure 2: Five sentences (without date, number, and 
name substitution) from a cluster of 49, similarities em- 
phasized. 



pendently to the two datasets making up a pair of compa- 
rable corpora. Individual patterns are learned by applying 
multiple-sequence alignment to clusters of sentences de- 
scribing approximately similar events; these patterns are 
represented compactly by lattices (see Figure[3}. We then 
check for lattices from the two different corpora that tend 
to take the same arguments; these lattice pairs are taken 
to be paraphrase patterns. 

Once training is done, we can generate paraphrases as 
follows: given the sentence "The surprise bombing in- 
jured twenty people, five of them seriously", we match 
it to the lattice X (injured/wounded) Y people, Z 
of them seriously which can be rewritten as Y were 
(wounded/hurt) by X, among them Z were in serious 
condition, and so by substituting arguments we can gen- 
erate "Twenty were wounded by the surprise bombing, 
among them five were in serious condition" or "Twenty 
were hurt by the surprise bombing, among them five were 
in serious condition". 



3.1 Sentence clustering 

Our first step is to cluster sentences into groups from 
which to learn useful patterns; for the multiple-sequence 
techniques we will use, this means that the sentences 
within clusters should describe similar events and have 
similar structure, as in the sentences of Figure This 
is accomplished by applying hierarchical complete-link 
clustering to the sentences using a similarity metric based 
on word n-gram overlap (n = 1, 2, 3, 4). The only sub- 
tlety is that we do not want mismatches on sentence de- 
tails (e.g., the location of a raid) causing sentences de- 
scribing the same type of occurrence (e.g., a raid) from 
being separated, as this might yield clusters too frag- 
mented for effective learning to take place. (Moreover, 
variability in the arguments of the sentences in a cluster 
is needed for our learning algorithm to succeed; see be- 
low.) We therefore first replace all appearances of dates, 
numbers, and proper names 2 with generic tokens. Clus- 
ters with fewer than ten sentences are discarded. 

3.2 Inducing patterns 

In order to learn patterns, we first compute a multiple- 
sequence alignment (MSA) of the sentences in a given 
cluster. Pairwise MSA takes two sentences and a scor- 
ing function giving the similarity between words; it de- 
termines the highest-scoring way to perform insertions, 
deletions, and changes to transform one of the sentences 
into the other. Pairwise MSA can be extended efficiently 
to multiple sequences via the iterative pairwise align- 
ment, a polynomial-time method commonly used in com- 
putational biology (Durb in et al., 199 8). 3 The results 
can be represented in an intuitive form via a word lat- 
tice (see Figure[3}, which compactly represents (n-gram) 
structural similarities between the cluster's sentences. 

To transform lattices into generation-suitable patterns 
requires some understanding of the possible varieties of 
lattice structures. The most important part of the transfor- 
mation is to determine which words are actually instances 
of arguments, and so should be replaced by slots (repre- 
senting variables). The key intuition is that because the 
sentences in the cluster represent the same type of event, 
such as a bombing, but generally refer to different in- 
stances of said event (e.g. a bombing in Jerusalem versus 
in Gaza), areas of large variability in the lattice should 
correspond to arguments. 

To quantify this notion of variability, we first formal- 
ize its opposite: commonality. We define backbone nodes 



Our crude proper-name identification method was to flag 
every phrase (extracted by a noun-phrase chunker) appearing 
capitalized in a non-sentence-initial position sufficiently often. 

3 Scoring function: aligning two identical words scores 
1; inserting a word scores -0.01, and aligning two dif- 
ferent words scores -0.5 (parameter values taken from 
|Barzilay and Lee (2002"} ). 
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Figure 3: Lattice and slotted lattice for the five sentences from Figure[2] Punctuation and articles removed for clarity. 



as those shared by more than 50% of the cluster's sen- 
tences. The choice of 50% is not arbitrary — it can 
be proved using the pigeonhole principle that our strict- 
majority criterion imposes a unique linear ordering of the 
backbone nodes that respects the word ordering within 
the sentences, thus guaranteeing at least a degree of well- 
formedness and avoiding the problem of how to order 
backbone nodes occurring on parallel "branches" of the 
lattice. 

Once we have identified the backbone nodes as points 
of strong commonality, the next step is to identify the re- 
gions of variability (or, in lattice terms, many parallel dis- 
joint paths) between them as (probably) corresponding to 
the arguments of the propositions that the sentences rep- 
resent. For example, in the top of Figure [5] the words 
"southern city, "settlement of NAME", "coastal resort of 
NAME", etc. all correspond to the location of an event 
and could be replaced by a single slot. Figure [3] shows 
an example of a lattice and the derived slotted lattice; we 
give the details of the slot-induction process in the Ap- 
pendix. 

3.3 Matching lattices 

Now, if we were using a parallel corpus, we could em- 
ploy sentence-alignment information to determine which 
lattices correspond to paraphrases. Since we do not have 
this information, we essentially approximate the parallel- 
corpus situation by correlating information from descrip- 
tions of (what we hope are) the same event occurring in 
the two different corpora. 

Our method works as follows. Once lattices for 
each corpus in our comparable-corpus pair are com- 
puted, we identify lattice paraphrase pairs, using the idea 
that paraphrases will tend to take the same values as 
arguments ( Shinyam a et al., 2002||Lin and Pante~ 2001 1. 
More specifically, we take a pair of lattices from different 
corpora, look back at the sentence clusters from which 
the two lattices were derived, and compare the slot val- 
ues of those cross-corpus sentence pairs that appear in 
articles written on the same day on the same topic; we 
pair the lattices if the degree of matching is over a thresh- 
old tuned on held-out data. For example, suppose we 
have two (linearized) lattices slotl bombed slot2 and 



slot3 was bombed by slot4 drawn from different cor- 
pora. If in the first lattice's sentence cluster we have the 
sentence "the plane bombed the town", and in the sec- 
ond lattice's sentence cluster we have a sentence written 
on the same day reading "the town was bombed by the 
plane", then the corresponding lattices may well be para- 
phrases, where slotl is identified with slot4 and slot2 
with slot3. 

To compare the set of argument values of two lattices, 
we simply count their word overlap, giving double weight 
to proper names and numbers and discarding auxiliaries 
(we purposely ignore order because paraphrases can con- 
sist of word re-orderings). 

3.4 Generating paraphrase sentences 

Given a sentence to paraphrase, we first need to iden- 
tify which, if any, of our previously-computed sentence 
clusters the new sentence belongs most strongly to. We 
do this by finding the best alignment of the sentence to 
the existing lattices. 4 If a matching lattice is found, we 
choose one of its comparable-corpus paraphrase lattices 
to rewrite the sentence, substituting in the argument val- 
ues of the original sentence. This yields as many para- 
phrases as there are lattice paths. 

4 Evaluation 

All evaluations involved judgments by native speakers of 
English who were not familiar with the paraphrasing sys- 
tems under consideration. 

We implemented our system on a pair of comparable 
corpora consisting of articles produced between Septem- 
ber 2000 and August 2002 by the Agence France-Presse 
(AFP) and Reuters news agencies. Given our interest in 
domain-dependent paraphrasing, we limited attention to 
9MB of articles, collected using a TDT-style document 
clustering system, concerning individual acts of violence 
in Israel and army raids on the Palestinian territories. 

4 To facilitate this process, we add "insert" nodes between 
backbone nodes; these nodes can match any word sequence and 
thus account for new words in the input sentence. Then, we per- 
form multiple- sequence alignment where insertions score -0.1 
and all other node alignments receive a score of unity. 



DIRT (Lin and Pantel) 
The 50 common template pairs, sorted by judges' perceived validity 



% of common instances deemed valkh 

% of common + individual instances- 
deemed valid 



judge 1 
judge 2 
judge 3 
judge 4 



=32% (cf. 31%) 
=40% (cf. 43%) 
=42% (cf. 37%) 
=56% (cf. 61%) 



"XI forced out of X2"; "XI ousted from X2" "XI stormed into X2"; "XI thrusted into X2" 



"Xl's candidacy for X2"; "X2 expressed XI 's condemnation" 



MSA 

The 50 common template pairs, sorted by judges' perceived validity 



judge 1 
judge 2 
judge 3 
judge 4 



=76% (cf. 74%) 
=74% (cf. 77%) 
=68% (cf. 79%) 
=96% (cf. 98%) 



"DATE: NUM1 are killed and around NUM2 injured when suicide bomber blows up 



his explosive-packed belt at XI in X2. 



"Palestinian suicide bomber blew himself up at XI in X2 DATE, killing NUM1 and wounding NUM2 police said 

"Palestinian suicide bomber blew himself up at XI in X2 on DATE, killing NUM1 and wounding NUM2, police 
"- DATE: bombing at XI in X2 kills NUM1 israelis and leaves NUM2."; 

"latest violence bring to NUM1 number of people killed as a direct result of the Palestinian [sic], including NUM2 Palestinians and NUM3 israelis.": 
"At least NUM1 Palestinians and NUM2 israelis have been killed since Palestinian uprising against Israeli occupation began in September 2000 after peace talks stalled." 



Figure 4: Correctness and agreement results. Columns = instances; each grey box represents a judgment of "valid" 
for the instance. For each method, a good, middling, and poor instance is shown. (Results separated by algorithm for 
clarity; the blind evaluation presented instances from the two algorithms in random order.) 



From this data (after removing 120 articles as a held- 
out parameter-training set), we extracted 43 slotted lat- 
tices from the AFP corpus and 32 slotted lattices from 
the Reuters corpus, and found 25 cross-corpus matching 
pairs; since lattices contain multiple paths, these yielded 
6,534 template pairs. 5 

4.1 Template Quality Evaluation 

Before evaluating the quality of the rewritings produced 
by our templates and lattices, we first tested the qual- 
ity of a random sample of just the template pairs. In 
our instructions to the judges, we defined two text units 
(such as sentences or snippets) to be paraphrases if one 
of them can generally be substituted for the other without 
great loss of information (but not necessarily vice versa). 
6 Given a pair of templates produced by a system, the 
judges marked them as paraphrases if for many instanti- 
ations of the templates' variables, the resulting text units 
were paraphrases. (Several labelled examples were pro- 
vided to supply further guidance). 

s The extracted paraphrases are available at 
http : / / www .cs.cornell.eclu/Info/Projects/ 
NLP/statpar . html 

6 We switched to this "one-sided" definition because in ini- 
tial tests judges found it excruciating to decide on equivalence. 
Also, in applications such as summarization some information 
loss is acceptable. 



To put the evaluation results into context, we wanted 
to compare against another system, but we are not 
aware of any previous work creating templates pre- 
cisely for the task of generating paraphrases. Instead, 
we made a good-faith effort to adapt the DIRT system 
(Lin and Pantel, 2001 1 to the problem, selecting the 6,534 
highest-scoring templates it produced when run on our 
datasets. (The system of Shinyama et al. (2002 1 was un- 
suitable for evaluation purposes because their paraphrase 
extraction component is too tightly coupled to the under- 
lying information extraction system.) It is important to 
note some important caveats in making this comparison, 
the most prominent being that DIRT was not designed 
with sentence-paraphrase generation in mind — its tem- 
plates are much shorter than ours, which may have af- 
fected the evaluators' judgments — and was originally 
implemented on much larger data sets. 7 The point of 
this evaluation is simply to determine whether another 
corpus-based paraphrase-focused approach could easily 
achieve the same performance level. 

In brief, the DIRT system works as follows. Depen- 

7 To cope with the corpus-size issue, DIRT was trained on 
an 84MB corpus of Middle-East news articles, a strict superset 
of the 9MB we used. Other issues include the fact that DIRTs 
output needed to be converted into English: it produces paths 
like "N:of:N (tide) N:nn:N", which we transformed into "Y tide 
of X" so that its output format would be the same as ours. 



dency trees are constructed from parsing a large corpus. 
Leaf-to-leaf paths are extracted from these dependency 
trees, with the leaves serving as slots. Then, pairs of 
paths in which the slots tend to be filled by similar val- 
ues, where the similarity measure is based on the mutual 
information between the value and the slot, are deemed 
to be paraphrases. 

We randomly extracted 500 pairs from the two algo- 
rithms' output sets. Of these, 100 paraphrases (50 per 
system) made up a "common" set evaluated by all four 
judges, allowing us to compute agreement rates; in addi- 
tion, each judge also evaluated another "individual" set, 
seen only by him- or herself, consisting of another 100 
pairs (50 per system). The "individual" sets allowed us to 
broaden our sample's coverage of the corpus. 8 The pairs 
were presented in random order, and the judges were not 
told which system produced a given pair. 

As Figure 0] shows, our system outperforms the DIRT 
system, with a consistent performance gap for all the 
judges of about 38%, although the absolute scores vary 
(for example, Judge 4 seems lenient). The judges' as- 
sessment of correctness was fairly constant between the 
full 100-instance set and just the 50-instance common set 
alone. 

In terms of agreement, the Kappa value (measuring 
pairwise agreement discounting chance occurrences 9 ) on 
the common set was 0.54, which corresponds to moderate 
agreement ( |Landis and Koch, 1977) . Multiway agree- 
ment is depicted in Figure — there, we see that in 86 
of 100 cases, at least three of the judges gave the same 
correctness assessment, and in 60 cases all four judges 
concurred. 

4.2 Evaluation of the generated paraphrases 

Finally, we evaluated the quality of the paraphrase sen- 
tences generated by our system, thus (indirectly) testing 
all the system components: pattern selection, paraphrase 
acquisition, and generation. We are not aware of another 
system generating sentence-level paraphrases. Therefore, 
we used as a baseline a simple paraphrasing system that 
just replaces words with one of their randomly-chosen 
WordNet synonyms (using the most frequent sense of the 
word that WordNet listed synonyms for). The number of 
substitutions was set proportional to the number of words 
our method replaced in the same sentence. The point of 

8 Each judge took several hours at the task, making it infea- 
sible to expand the sample size further. 

'One issue is that the Kappa statistic doesn't account for 
varying difficulty among instances. For this reason, we actu- 
ally asked judges to indicate for each instance whether making 
the validity decision was difficult. However, the judges gen- 
erally did not agree on difficulty. Post hoc analysis indicates 
that perception of difficulty depends on each judge's individual 
"threshold of similarity", not just the instance itself. 



this comparison is to check whether simple synonym sub- 
stitution yields results comparable to those of our algo- 
rithm. 10 

For this experiment, we randomly selected 20 AFP ar- 
ticles about violence in the Middle East published later 
than the articles in our training corpus. Out of 484 sen- 
tences in this set, our system was able to paraphrase 59 
(12.2%). (We chose parameters that optimized precision 
rather than recall on our small held-out set.) We found 
that after proper name substitution, only seven sentences 
in the test set appeared in the training set, 11 which im- 
plies that lattices boost the generalization power of our 
method significantly: from seven to 59 sentences. Inter- 
estingly, the coverage of the system varied significantly 
with article length. For the eight articles of ten or fewer 
sentences, we paraphrased 60.8% of the sentences per ar- 
ticle on average, but for longer articles only 9.3% of the 
sentences per article on average were paraphrased. Our 
analysis revealed that long articles tend to include large 
portions that are unique to the article, such as personal 
stories of the event participants, which explains why our 
algorithm had a lower paraphrasing rate for such articles. 

All 118 instances (59 per system) were presented in 
random order to two judges, who were asked to indi- 
cate whether the meaning had been preserved. Of the 
paraphrases generated by our system, the two evalua- 
tors deemed 81.4% and 78%, respectively, to be valid, 
whereas for the baseline system, the correctness results 
were 69.5% and 66.1%, respectively. Agreement accord- 
ing to the Kappa statistic was 0.6. Note that judging full 
sentences is inherently easier than judging templates, be- 
cause template comparison requires considering a variety 
of possible slot values, while sentences are self-contained 
units. 

Figure [5] shows two example sentences, one where 
our MSA-based paraphrase was deemed correct by both 
judges, and one where both judges deemed the MSA- 
generated paraphrase incorrect. Examination of the re- 
sults indicates that the two systems make essentially or- 
thogonal types of errors. The baseline system's relatively 
poor performance supports our claim that whole-sentence 
paraphrasing is a hard task even when accurate word- 
level paraphrases are given. 

5 Conclusions 

We presented an approach for generating sentence level 
paraphrases, a task not addressed previously. Our method 
learns structurally similar patterns of expression from 
data and identifies paraphrasing pairs among them using 

10 We chose not to employ a language model to re-rank either 
system's output because such an addition would make it hard to 
isolate the contribution of the paraphrasing component itself. 

"Since we are doing unsupervised paraphrase acquisition, 
train-test overlap is allowed. 



Original (1) 


The caller identified the bomber as Yussef Attala, 20, from the Balata refugee camp near Nablus. 


MSA 


The caller named the bomber as 20-year old Yussef Attala from the Balata refugee camp near Nablus. 


Baseline 


The company placed the bomber as Yussef Attala, 20, from the Balata refugee camp near Nablus. 


Original (2) 


A spokesman for the group claimed responsibility for the attack in a phone call to AFP in this northern West Bank town. 


MSA 


The attack in a phone call to AFP in this northern West Bank town was claimed by a spokesman of the group. 


Baseline 


A spokesman for the grouping laid claim responsibility for the onslaught in a phone call to AFP in this northern West 
Bank town. 



Figure 5: Example sentences and generated paraphrases. Both judges felt MSA preserved the meaning of (1) but not 
(2), and that neither baseline paraphrase was meaning-preserving. 



a comparable corpus. A flexible pattern-matching pro- 
cedure allows us to paraphrase an unseen sentence by 
matching it to one of the induced patterns. Our approach 
generates both lexical and structural paraphrases. 

Another contribution is the induction of 
MSA lattices from non-parallel data. Lat- 
tices have proven advantageous in a number of 
NLP contexts (|M angu, Bri ll, and Stolcke, 2000| 
Bangalore, Murdock, and Riccardi, 2002; 
Barzilay and Lee, 2002||Pang, Knight, and Marcu, 2003) , 
but were usually produced from (multi-)parallel data, 
which may not be readily available for many applica- 
tions. We showed that word lattices can be induced from 
a type of corpus that can be easily obtained for many 
domains, broadening the applicability of this useful 
representation. 
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Appendix 

In this appendix, we describe how we insert slots into 
lattices to form slotted lattices. 

Recall that the backbone nodes in our lattices represent 
words appearing in many of the sentences from which 
the lattice was built. As mentioned above, the intuition 
is that areas of high variability between backbone nodes 
may correspond to arguments, or slots. But the key thing 
to note is that there are actually two different phenomena 
giving rise to multiple parallel paths: argument variabil- 
ity, described above, and synonym variability. For exam- 
ple, Figure |6jb) contains parallel paths corresponding to 
the synonyms "injured" and "wounded". Note that we 
want to remove argument variability so that we can gen- 
erate paraphrases of sentences with arbitrary arguments; 
but we want to preserve synonym variability in order to 
generate a variety of sentence rewritings. 

To distinguish these two situations, we analyze the split 
level of backbone nodes that begin regions with multi- 
ple paths. The basic intuition is that there is probably 
more variability associated with arguments than with syn- 
onymy: for example, as datasets increase, the number of 
locations mentioned rises faster than the number of syn- 
onyms appearing. We make use of a synonymy threshold 
s (set by held-out parameter-tuning to 30), as follows. 

• If no more than s% of all the edges out of a back- 
bone node lead to the same next node, we have high 
enough variability to warrant inserting a slot node. 



• Otherwise, we incorporate reliable synonyms into 
the backbone structure by preserving all nodes that 
are reached by at least s% of the sentences passing 
through the two neighboring backbone nodes. 

Furthermore, all backbone nodes labelled with our spe- 
cial generic tokens are also replaced with slot nodes, 
since they, too, probably represent arguments (we con- 
dense adjacent slots into one). Nodes with in-degree 
lower than the synonymy threshold are removed un- 
der the assumption that they probably represent idiosyn- 
crasies of individual sentences. See Figure [6] for exam- 
ples. 

Figure|5]shows an example of a lattice and the slotted 
lattice derived via the process just described. 

(a) Argument variability 




no more than 2 out of 7 (28%) 
of the sentences lead 
to the same node 



(b) Synonym variability ,. Preserve both mdes . 

3 out of 7 (43%) 
\ of the sentences 
'> lead to these nodes 




\ Arrestee} 



Delete: 

only 1 out of 7 (14%) 
of the sentences lead here 

Figure 6: Simple seven-sentence examples of two types 
of variability. The double-boxed nodes are backbone 
nodes; edges show consecutive words in some sentence. 
The synonymy threshold (set to 30% in this example) de- 
termines the type of variability. 



12 While our original implementation, evaluated in Section|4| 
identified only single-word synonyms, phrase-level synonyms 
can similarly be acquired by considering chains of nodes con- 
necting backbone nodes. 



